hdfs fsck命令查看HDFS文件对应的文件块信息(Block)和位置信息

在HDFS中，提供了fsck命令，用于检查HDFS上文件和目录的健康状态、获取文件的block信息和位置信息等。
fsck命令必须由HDFS超级用户来执行，普通用户无权限。

[hadoop@hadoop ~]$ hdfs fsck
Usage: hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks | -replicaDetails | -upgradedomains]]]] [-includeSnapshots] [-storagepolicies] [-blockId <blk_Id>]
	<path>	start checking from this path
	-move	move corrupted files to /lost+found
	-delete	delete corrupted files
	-files	print out files being checked
	-openforwrite	print out files opened for write
	-includeSnapshots	include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
	-list-corruptfileblocks	print out list of missing blocks and files they belong to
	-files -blocks	print out block report
	-files -blocks -locations	print out locations for every block
	-files -blocks -racks	print out network topology for data-node locations
	-files -blocks -replicaDetails	print out each replica details 
	-files -blocks -upgradedomains	print out upgrade domains for every block
	-storagepolicies	print out storage policy summary for the blocks
	-blockId	print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)

Please Note:
	1. By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually  tagged CORRUPT or HEALTHY depending on their block allocation status
	2. Option -includeSnapshots should not be used for comparing stats, should be used only for HEALTH check, as this may contain duplicates if the same file present in both original fs tree and inside snapshots.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
command [genericOptions] [commandOptions]

具体操作：

查看文件中损坏的块（-list-corruptfileblocks）
[hadoop@hadoop ~]$ hdfs fsck /home/hadoop/clear/day=20180717/ -list-corruptfileblocks

将损坏的文件移动至/lost+found目录（-move）
[hadoop@hadoop ~]$ hdfs fsck /home/hadoop/clear/day=20180717/part-r-00000 -move

删除损坏的文件（-delete）
[hadoop@hadoop ~]$ hdfs fsck /home/hadoop/clear/day=20180717/part-r-00000 -delete

检查并列出所有文件状态（-files）
[hadoop@hadoop ~]$ hdfs fsck /home/hadoop/clear/day=20180717/ -files

检查并打印正在被打开执行写操作的文件（-openforwrite）
[hadoop@hadoop ~]$ hdfs fsck /home/hadoop/clear/day=20180717/ -openforwrite

打印文件的Block报告（-blocks） 需要和-files一起使用。
[hadoop@hadoop ~]$ hdfs fsck /home/hadoop/clear/day=20180717/part-r-00000 -files -blocks
Connecting to namenode via http://hadoop:50070/fsck?ugi=hadoop&files=1&blocks=1&path=%2Fhome%2Fhadoop%2Fclear%2Fday%3D20180717%2Fpart-r-00000
FSCK started by hadoop (auth:SIMPLE) from /192.168.232.8 for path /home/hadoop/clear/day=20180717/part-r-00000 at Mon Apr 01 14:48:16 CST 2019
/home/hadoop/clear/day=20180717/part-r-00000 72432 bytes, 1 block(s): OK
0. BP-2127332931-192.168.232.8-1545632462593:blk_1073741866_1042 len=72432 Live_repl=1

其中，/logs/site/2015-08-08/lxw1234.log 7408754725 bytes, 56 block(s): 表示文件的总大小和block数；
0. BP-2127332931-192.168.232.8-1545632462593:blk_1073741866_1042 len=72432 Live_repl=1

前面的0代表该文件的block索引，56的文件块，就从0-55;

BP-1034052771-172.16.212.130-1405595752491:blk_1075892982_2152381表示block id；

len=72432 表示该文件块大小；

Live_repl=1 表示该文件块副本数；

打印文件块的位置信息（-locations）需要和-files -blocks一起使用。
[hadoop@hadoop ~]$ hdfs fsck /home/hadoop/clear/test.log -files -blocks -locations
和打印出的文件块信息相比，多了一个文件块的位置信息

打印文件块位置所在的机架信息（-racks）
[hadoop@hadoop ~]$ hdfs fsck /home/hadoop/clear/test.log -files -blocks -locations -racks
和前面打印出的信息相比，多了机架信息

断电导致HDFS块的损坏如何恢复

【转自若泽大数据文档】

1.现象:
断电导致HDFS服务不正常或者显示块损坏

2.检查HDFS系统文件健康
hdfs fsck /

3.检查hdfs fsck -list-corruptfileblocks
Connecting to namenode via http://hadoop36:50070/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F
The list of corrupt files under path ‘/‘ are:
blk_1075229920 /hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd
blk_1075229921 /hbase/data/JYDW/WMS_PO_ITEMS/c96cb6bfef12795181c966a8fc4ef91d/0/cf44ae0411824708bf6a894554e19780
The filesystem under path ‘/‘ has 2 CORRUPT files

4.分析:如果知道文件的来源；也就是文件可以从其他库里刷过来一份；可以暴力的把hdfs损坏文件删除
MySQL–》大数据平台
只需要从MySQL这个表的数据重新刷新一份到HDFS平台

5.想要知道文件的哪些块分布在哪些机器上面？手工删除linux文件/dfs/dn/…..
hadoop36:hdfs:/var/lib/hadoop-hdfs:>

-files 文件分块信息，
-blocks 在带-files参数后才显示block信息
-locations 在带-blocks参数后才显示block块所在datanode的具体IP位置，
-racks 在带-files参数后显示机架位置

无法显示,无法手工删除块文件:
hdfs fsck /hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd -files -locations -blocks -racks
Connecting to namenode via http://hadoop36:50070/fsck?ugi=hdfs&locations=1&blocks=1&files=1&path=%2Fhbase%2Fdata%2FJYDW%2FWMS\_PO\_ITEMS%2Fc71f5f49535e0728ca72fd1ad0166597%2F0%2Ff4d3d97bb3f64820b24cd9b4a1af5cdd
FSCK started by hdfs (auth:SIMPLE) from /192.168.1.100 for path /hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd at Sat Jan 20 15:46:55 CST 2018
/hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd 2934 bytes, 1 block(s):
/hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd: CORRUPT blockpool BP-1437036909-192.168.1.100-1509097205664 block blk_1075229920
MISSING 1 blocks of total size 2934 B
0. BP-1437036909-192.168.1.100-1509097205664:blk_1075229920_1492007 len=2934 MISSING!

Status: CORRUPT
Total size: 2934 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 2934 B)

UNDER MIN REPL’D BLOCKS: 1 (100.0 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 2934 B
CORRUPT BLOCKS: 1

Minimally replicated blocks: 0 (0.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 1
Missing replicas: 0
Number of data-nodes: 12
Number of racks: 1
FSCK ended at Sat Jan 20 15:46:55 CST 2018 in 0 milliseconds

The filesystem under path ‘/hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd’ is CORRUPT
hadoop36:hdfs:/var/lib/hadoop-hdfs:>

好的文件是显示块分布情况的：
hadoop36:hdfs:/var/lib/hadoop-hdfs:>hdfs fsck /hbase/data/JYDW/WMS_TO/011dea9ae46dae6c1f1f3a24a75af100/0/1d60f56773984e4cac614a8b5f7e93a6 -files -locations -blocks -racks
Connecting to namenode via http://hadoop36:50070/fsck?ugi=hdfs&files=1&locations=1&blocks=1&racks=1&path=%2Fhbase%2Fdata%2FJYDW%2FWMS_TO%2F011dea9ae46dae6c1f1f3a24a75af100%2F0%2F1d60f56773984e4cac614a8b5f7e93a6
FSCK started by hdfs (auth:SIMPLE) from /192.168.1.100 for path /hbase/data/JYDW/WMS_TO/011dea9ae46dae6c1f1f3a24a75af100/0/1d60f56773984e4cac614a8b5f7e93a6 at Sat Jan 20 15:58:25 CST 2018
/hbase/data/JYDW/WMS_TO/011dea9ae46dae6c1f1f3a24a75af100/0/1d60f56773984e4cac614a8b5f7e93a6 1697 bytes, 1 block(s): OK
0. BP-1437036909-192.168.1.100-1509097205664:blk_1075227504_1489591 len=1697 Live_repl=3 [/default/192.168.1.150:50010, /default/192.168.1.153:50010, /default/192.168.1.145:50010]

blk_1075227504_1489591 len=1697 Live_repl=3
[/default/192.168.1.150:50010, /default/192.168.1.153:50010, /default/192.168.1.145:50010]

6.最终选择一了百了，删除损坏的块文件，然后业务系统数据重刷
hadoop36:hdfs:/var/lib/hadoop-hdfs:>hdfs fsck / -delete

7.假设数据仅有HDFS上【文件只有hdfs上有；其他来源没有；这个时候如果有副本是完好的；有的副本是损坏的】
7.1 hdfs dfs -ls /xxxx
hdfs dfs -get /xxxx ./ 下载好完好的副本到Linux环境
hdfs dfs -rm /xxx 删除已有的文件包括损坏的副本文件
hdfs dfs -put xxx / 上传完好的副本文件；此时hdfs就会自动完善3个副本。

工作中：log文件丢一丢丢没有关系；文件是业务数据订单数据丢了，需要报告

手动修复损坏的块【hdfs debug】

hdfs命令帮助是没有debug的，但是确实有hdfs debug这个组合命令，切记。

hdfs debug recoverLease -path hdfs文件位置 -retries 10

自动修复

当数据块损坏后，DN节点执⾏行行directoryscan操作之前，都不会发现损坏；
也就是directoryscan操作是间隔6h
dfs.datanode.directoryscan.interval : 21600
在DN向NN进行blockreport前，都不会恢复数据块;
也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000
当NN收到blockreport才会进行恢复操作。

注意：手动修复方式，但是前提要手动删除损坏的block块。
切记，是删除损坏block文件和meta⽂文件，而不是删除hdfs文件。
当然还可以先把文件get下载，然后hdfs删除，再对应上传。
切记删除不要执行: hdfs fsck / -delete 这是删除损坏的文件，那么数据不就丢了了嘛；除非无所谓丢数据，或者有信心从其他地方可以补数据到hdfs！

查找block块对应的本地存储路径

先通过hdfs fsck /来查看损坏块的位置
【这里我们直接hdfs fsck /blockrecover/test.txt -files -blocks -locations 查看某文件块的位置信息】

[hadoop@hadoop01 ~]$ hdfs fsck /blockrecover/test.txt -files -blocks -locations 
Connecting to namenode via http://hadoop02:50070
FSCK started by hadoop (auth:SIMPLE) from /192.168.232.5 for path /blockrecover/test.txt at Mon Apr 01 22:02:46 CST 2019
/blockrecover/test.txt 5 bytes, 1 block(s):  OK
0. BP-217229950-192.168.232.5-1554022618135:blk_1073741843_1019 len=5 Live_repl=3 [DatanodeInfoWithStorage[192.168.232.5:50010,DS-dbe82ccf-7d44-40ba-bd14-ae9753139ff0,DISK], DatanodeInfoWithStorage[192.168.232.6:50010,DS-b0ce0673-18b9-4d65-8a56-eac48428a5a1,DISK], DatanodeInfoWithStorage[192.168.232.7:50010,DS-ba099ac3-e054-46c7-8eb8-1f80feefb6ca,DISK]]

Status: HEALTHY
 Total size:	5 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 5 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	3.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		3
 Number of racks:		1
FSCK ended at Mon Apr 01 22:02:46 CST 2019 in 2 milliseconds


The filesystem under path '/blockrecover/test.txt' is HEALTHY

0. BP-217229950-192.168.232.5-1554022618135:blk_1073741843_1019 len=5 Live_repl=3 由本条信息可以得到某个副本位置

[hadoop@hadoop03 subdir0]$ pwd
/home/hadoop/app/hadoop/data/dfs/data/current/BP-217229950-192.168.232.5-1554022618135/current/finalized/subdir0/subdir0
[hadoop@hadoop03 subdir0]$ ll
总用量 544
-rw-rw-r--. 1 hadoop hadoop  67405 3月  31 17:33 blk_1073741838
-rw-rw-r--. 1 hadoop hadoop    535 3月  31 17:33 blk_1073741838_1014.meta
-rw-rw-r--. 1 hadoop hadoop 115580 3月  31 17:33 blk_1073741839
-rw-rw-r--. 1 hadoop hadoop    911 3月  31 17:33 blk_1073741839_1015.meta
-rw-rw-r--. 1 hadoop hadoop 221195 3月  31 17:33 blk_1073741840
-rw-rw-r--. 1 hadoop hadoop   1739 3月  31 17:33 blk_1073741840_1016.meta
-rw-rw-r--. 1 hadoop hadoop  58897 3月  31 17:33 blk_1073741841
-rw-rw-r--. 1 hadoop hadoop    471 3月  31 17:33 blk_1073741841_1017.meta
-rw-rw-r--. 1 hadoop hadoop  59669 3月  31 17:33 blk_1073741842
-rw-rw-r--. 1 hadoop hadoop    475 3月  31 17:33 blk_1073741842_1018.meta
-rw-rw-r--. 1 hadoop hadoop      5 4月   1 21:54 blk_1073741843
-rw-rw-r--. 1 hadoop hadoop     11 4月   1 21:54 blk_1073741843_1019.meta

【假设本条是损坏块】则可以使用hdfs debug recoverLease -path /blockrecover/test.txt -retries 10 来进行修复；恢复三个副本的完整状态

[Hadoop] 如何确定block损坏的位置和修复

hdfs fsck命令查看HDFS文件对应的文件块信息(Block)和位置信息

断电导致HDFS块的损坏如何恢复

手动修复损坏的块【hdfs debug】

查找block块对应的本地存储路径

FEATURED TAGS